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Abstract 

We propose a novel Bayesian approach to solve stochastic optimization problems 
that involve fnding extrema of noisy, nonlinear functions. Previous work has fo- 
cused on representing possible functions explicitly, which leads to a two-step pro- 
cedure of first, doing inference over the function space and second, finding the 
extrema of these functions. Here we skip the representation step and directly 
model the distribution over extrema. To this end, we devise a non-parametric 
conjugate prior based on a kernel regressor. The resulting posterior distribution 
directly captures the uncertainty over the maximum of the unknown function. We 
illustrate the effectiveness of our model by optimizing a noisy, high-dimensional, 
non-convex objective function. 



1 Introduction 



Historically, the fields of statistical inference and stochastic optimization have often developed their 
own specific methods and approaches. Recently, however, there has been a growing interest in 
applying inference-based methods to optimization problems and vice versa 

GUI. Here we consider 

stochastic optimization problems where we observe noise-contaminated values from an unknown 
nonlinear function and we want to find the input that maximizes the expected value of this function. 

The problem statement is as follows. Let X be a metric space. Consider a stochastic function 
/ : X K mapping a test point x G X to real values y G M characterized by the conditional pdf 
P(y\x). Consider the mean function 

f{x) :=ny\x] = J yP(y\x)dy. (1) 

The goal consists in modeling the optimal test point 

x* := argmax{/(j;)}. (2) 
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Figure 1 : a) Given an estimate h of the mean function / (left), a simple probability density function 
over the location of the maximum x* is obtained using the transformation P(x*) oc exp{ah(x*)}, 
where a > plays the role of the precision (right), b) Illustration of the Gramian matrix for different 
test locations. Locations thar are close to each other produce large off-diagonal entries. 




Classic approaches to solve this problem are often based on stochastic approximation methods Q5D . 
Within the context of statistical inference, Bayesian optimization methods have been developed 
where a prior distribution over the space of functions is assumed and uncertainty is tracked during 
the entire optimization process BalU. In particular, non-parametric Bayesian approaches such as 
Gaussian Processes have been applied for derivative-free optimization JalSD, also within the context 
of the continuum-armed bandit problem [ 10]. Typically, these Bayesian approaches aim to explicitly 
represent the unknown objective function of ([T) by entertaining a posterior distribution over the 
space of objective functions. In contrast, we aim to model directly the distribution of the maximum 
of (O conditioned on observations. 

The paper is structured as follows. Section|2]gives a brief description of the model suitable for direct 
implementation. The model is then derived in Section [3] Section [4] presents experimental results. 
Section|4]concludes. 



2 Description of the Model 

Our model is intuitively straightforward and easy to implemenfl Let h(x) : X — > K be an estimate 
of the mean f(x) constructed from data T> t :— {(xi, ?/i)}i=i (FigureQJ, left). This estimate can 
easily be converted into a posterior pdf over the location of the maximum by first multiplying it with 
a precision parameter a > and then taking the normalized exponential (Figure QJ, right) 

P{x*\T>t) oc exp{a • h(x*)}. 
In this transformation, the precision parameter a controls the certainty we have over our estimate of 
the maximizing argument: a w expresses almost no certainty, while a — >• oc expresses certainty. 
The rationale for the precision is: the more distinct inputs we test, the higher the precision — testing 
the same (or similar) inputs only provides local information and therefore should not increase our 
knowledge about the global maximum. A simple and effective way of implementing this idea is 
given by 

WlPjcccxpfp (c I t Ei^EhEil ) Ei gfo **)* + *o(**)tt>(**) ] (3) 



effective # of locations estimate of / (x * ) 

where p, £, K, Kq and j/o are parameters of the estimator: p > is the precision we gain for 
each new distinct observation; £ > is the number of prior points; K : X x X — > IR + is a finite, 
symmetric kernel function; Ko : X — > R + is a prior precision function; and yo : X — > K is a prior 
estimate of /. 

In (O, the mean function / is estimated with a kernel regressor [11], and the total effective number 
of locations is calculated as the sum of the prior locations £ and the number of distinct locations in 
the data V t . The latter is estimated by multiplying the number of data points t with the coefficient 

Y,iK{xi,x.i) 



G (0,1], 



'implementations can be downloaded from http://www.adaptiveagents.org/argmaxprior 
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i.e. the ratio between the trace of the Gramian matrix (K(xi,Xj))ij and the sum of its entries. Inputs 
that are very close to each other will have overlapping kernels, resulting in large off-diagonal entries 
of the Gramian matrix — hence decreasing the number of distinct locations (FigureUJi). 

The expression for the posterior can be calculated, up to a constant factor, in quadratic time in 
the number of observations. It can therefore be easily combined with Markov chain Monte Carlo 
methods (MCMC) to implement stochastic optimizers as illustrated in Section|4] 



3 Derivation of the Model 
3.1 Function-Based, Indirect Model 

Our first task is to derive an indirect Bayesian model for the optimal test point that builds its estimate 
via the underlying function space. Let Q be the set of hypotheses, and assume that each hypothesis 
g G G corresponds to a stochastic mapping g : X ~-> R. Let P(g) be the priofl over Q and let the 
likelihood be P({y t }\g, {xt}) = Ylt P{Vt\9i x t)- Then, the posterior of g is given by 

p ({yt}\{xt}) P({yt}\{x t }) 

For each x* G X, let Q(x*) C Q be the subset of functions such that for all g G Q[x*), x* — 
a,rgmax x {g(x)^. Then, the posterior over the optimal test point x* is given by 



P(x*\{y t },{x t })= / P(g\{y t },{x t })dg, (5) 

Jg{x*) 

This model has two important drawbacks: (a) it relies on modeling the entire function space Q, 
which is potentially much more complex than necessary; (b) it requires calculating the integral (|5}, 
which is intractable for virtually all real-world problems . 



3.2 Domain-Based, Direct Model 

We want to arrive at a Bayesian model that bypasses the integration step suggested by (0 and directly 
models the location of optimal test point x* . The following theorem explains how this direct model 
relates to the previous model. 

Theorem 1. The Bayesian model for the optimal test point x* is given by 

P{x*)= [ P{g)dg (prior) 

Jg(x-) 

Jg(x') p (yt\g> Xt)P(g) E[fc=i p (Vk\g, x k ) dg 
fg(x*) P (-9) UtJi p {Vk\g, %k) dg 
where T>t '■= {(%k, 2/fe)}fc=l ' s ^ set of past tests. 



P(y t \x* ,x t ,V t _x) = — - i _ rt _ 1 — — ■ — , (likelihood) 



Proof. Using Bayes' rule, the posterior distribution P(x*\{yt}, {xt}) can be rewritten as 

P{.x*)Y\ t P(y t \x\x u V t ^) 

P({yt}\{x t }) 

Since this posterior is equal to (|5), one concludes (using dU) that 



(6) 



P(x*)T[P(y t \ X *,x u V t _ 1 )= [ P(g)]l 



P(yt\g,x t )dg. 



Note that this expression corresponds to the joint P(x* , {yt}\{ x t})- The prior P(x*) is obtained by 
setting t = 0. The likelihood is obtained as the fraction 

p(x*,{y fc }UK^}Ui) 



P(yt\x*,xt,V t -i) = 



^*,{wH=il{^}fc=i)' 



2 For the sake of simplicity, we neglect issues of measurability of Q. 

3 Note that we assume that the mean function g is bounded and that it has a unique maximizing test point. 
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Figure 2: Illustration of assumptions, a) Three functions from Q(x*). They all have their maximum 
located at x* € X. b) Schematic representation of the likelihood function of x* G X conditioned 
on a few observations. The curve corresponds to the mean and the shaded area to the confidence 
bounds. The density inside of the neighborhood is unique to the hypothesis x*, while the density 
outside is shared amongst all the hypotheses, c) The log-likelihood ratio of the hypotheses x* and 
x*2 as a function of the test point x. The kernel used in the plot is Gaussian. 



where it shall be noted that the denominator P(x* , {ykY^Ji li^fejfc^i) doesn't change if we add the 
condition x t . □ 

From Theorem[T]it is seen that although the likelihood model P(yt\g, Xt) for the indirect model 
is i.i.d. at each test point, the likelihood model P(y t \x*, x t , 2?t-i) for the direct model depends 
on the past tests Pt_i, that is, it is adaptive. More critically though, the likelihood function's 
internal structure of the direct model corresponds to an integration over function space as well — 
thus inheriting all the difficulties of the indirect model. 

3.3 Abstract Properties of the Likelihood Function 

There is a way to bypass modeling the function space explicitly if we make a few additional as- 
sumptions. We assume that for any g G G(x*), the mean function g is continuous and has a unique 
maximum. Then, the crucial insight consists in realizing that the value of the mean function g inside 
a sufficiently small neighborhood of x* is larger than the value outside of it (see Figure |2^). 

We assume that, for any <5 > and any z G X, let E>s(z) denote the open J-ball centered on z. The 
functions in Q fulfill the following properties: 

a. Continuous: Every function g G Q is such that its mean g is continuous and bounded. 

b. Maximum: For any x* G X, the functions g G G(x*) are such that for all S > and all 

z £ B s {x*),g{x*) > g(z). 

Furthermore, we impose a symmetry condition on the likelihood function. Let x\ and x\ be in X, 
and consider their associated equivalence classes G{x\) and ^(a;^)- There is no reason for them to 
be very different: in fact, they should virtually be indistinguishable outside of the neighborhoods 
of x\ and x^- It is only inside of the neighborhood of x\ when G{x\) becomes distinguishable from 
the other equivalence classes because the functions in G{x\) systematically predict higher values 
than the rest. This assumption is illustrated in Figure |2j). In fact, taking the log-likelihood ratio of 
two competing hypotheses 

P(y t \xl,Xt,T>t-i) 
P(yt\x%,xt,T>t-i) 

for a given test location x t should give a value equal to zero unless Xt is inside of the vicinity of x\ 
or x% (see Figure[2j;). In other words, the amount of evidence a hypothesis gets when the test point 
is outside of its neighborhood is essentially zero (i.e. it is the same as the amount of evidence that 
most of the other hypotheses get). 
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3.4 Likelihood and Conjugate Prior 

Following our previous discussion, we propose the following likelihood model. Given the previous 
data T>t-i and a test point xt E X, the likelihood of the observation yt is 

P(Vt\x*,x u V t _x) = TT, ^ A-i)exp{a t • h t (x*) - a t _ x • h t -i(x*)\, (7) 

where: Z(x t ,T> t ~i) is a normalizing constant; \(y t \x t , Dt—i) i s a posterior probability over y t 
given x t and the data T) t ~\\ at is a precision measuring the knowledge we have about the whole 
function given by 

J2 i K(x l ,x l ) 



a 



p-£ and a t := p ■ U 



where p > is a precision scaling parameter; £ > is a parameter representing the number prior 
locations tested; and h t is an estimate of the mean function / given by 

, / *x / *s A u t *\ T,l=i K ( x h x *)yi + K (x*)y (x*) 
ho(x ) := y (x ) and h t (x ) := — ■ 

In the last expression, y corresponds to a prior estimate of / with prior precision K . Inspecting (0, 
we see that the likelihood model favours positive changes to the estimated mean function/rom new, 
unseen test locations. The pdf X(y t \xt,'Dt-i) does not need to be explicitly defined, as it will later 
drop out when computing the posterior. The only formal requirement is that it should be independent 
of the hypothesis x*. 

We propose the conjugate prior 

P(x*) = — exp{a • go{x*)} = — cxp{£ -y (x*)}. (8) 

The conjugate prior just encodes a prior estimate of the mean function. In a practical optimization 
application, it serves the purpose of guiding the exploration of the domain, as locations x* with high 
prior value yo(x*) are more likely to contain the maximizing argument. 

Given a set of data points 2? t , the prior (0 and the likelihood d7} lead to a posterior given by 

^(**)nLi p Wz%zfc>2>fc-i) 



P{x*\V t ) 



fx p ( x ') Iffc=i P{Vk\ x ', x h ,V k _ x ) dx> 

ex P{Z)l=i a k ■ h k {x*) - a fe _i • h k -i{x*)}Z^ 1 Yik=i Zixk^k-x)- 1 
J* ex P{S!=i a k ■ hk(x') - a k -i ■ hk-iix'^Z^ 1 n' fc= i Zixk.Vk-iy 1 dx' 

exp{a t • h t (x*)} 
J x exp{a t • h t (x')} dx' 



(9) 



Thus, the particular choice of the likelihood function guarantees an analytically compact posterior 
expression. In general, the normalizing constant in ((9) is intractable, which is why the expression is 
only practical for relative comparisons of test locations. Substituting the precision a t and the mean 
function estimate h t yields 

I V 22i22j K ( x i> x 3)J 2Zi K K x h x *) + K (x*) 
4 Experimental Results 
4.1 Parameters. 

We have investigated the influence of the parameters on the resulting posterior probability distribu- 
tion. Figure[3]shows how the choice of the precision p and the kernel width a affect the shape of the 
posterior probability density. We have used the Gaussian kernel 



K(x,x*) = exp|-^(x - a;*) 2 j. 



(10) 
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Figure 3: Effect of the change of parameters on the posterior density over the location of the max- 
imizing test point. Panel (a) shows the 7 data points drawn from the noisy function (solid curve). 
Panel (b) shows the effect of diminishing the precision on the posterior, where solid and shaded 
curves correspond to p = 0.2 and p = 0.1 respectively. Panel (c) shows the effect of increasing 
the width of the kernel (here, Gaussian). The solid and dotted curves correspond to u = 0.01 and 
a = 0.1 respectively. 



In this figure, 7 data points are shown, which were drawn as y ~ N (f(x), 0.3), where the mean 
function is 

f(x) = cos(2x + §7r) + sin(6x + §tt). (1 1) 

The functions K n and y were chosen as 

K (x) = 1 and y (x) = --^(a; - p ) 2 , (12) 

where the latter corresponds to the logarithm of a Gaussian with mean po = 1.5 and vari- 
ance Uq = 5. Choosing a higher value for p leads to sharper updates, while higher values for the 
kernel width a produce smoother posterior densities. 

4.2 Application to Optimization. 

Comparison to Gaussian Process UCB. We have used the model to optimize the same func- 
tion (fTTT > as in our preliminary tests but with higher additive noise equal to one. This is done by sam- 
pling the next test point xt directly from the posterior density over the optimum location P(x* \T>t), 
and then using the resulting pair (act, Vt) to recursively update the model. Essentially, this procedure 
corresponds to Bayesian control rule/Thompson sampling ||T2l,|l3ll . 

We compared our method against a Gaussian Process optimization method using an upper con- 
fidence bound (UCB) criterion ifToll . The parameters for the GP-UCB were set to the following 
values: observation noise a n = 0.3 and length scale £ — 0.3. For the constant that trades off ex- 
ploration and exploitation we followed Theorem 1 in [10] which states f3 t — 2 log(\D\t 2 ir 2 /65) 
with 5 = 0.5. We have implemented our proposed method with a Gaussian kernel as in ( fTOb with 
width a 2 = 0.05. The prior sufficient statistics are exactly as in (TT2V The precision parameter was 
set to p — 0.3. 

Simulation results over ten independent runs are summarized in Figure [4] We show the time- 
averaged observation values y of the noisy function evaluated at test locations sampled from the 
posterior. Qualitatively, both methods show very similar convergence (on average), however our 
method converges faster and with a slightly higher variance. 

High-Dimensional Problem. To test our proposed method on a challenging problem, we have 
designed a non-convex, high-dimensional noisy function with multiple local optima. This Noisy 
Ripples function is defined as 

f( x ) = - jmW x ~ ^H 2 + cos (f % W X ~ Mil) 
where p e X is the location of the global maximum, and where observations have additive Gaussian 
noise with zero mean and variance 0.1. The advantage of this function is that it generalizes well to 
any number of dimensions of the domain. Figure [5^ illustrates the function for the 2-dimensional 
input domain. This function is difficult to optimize because it requires averaging the noisy observa- 
tions and smoothing the ridged landscape in order to detect the underlying quadratic form. 

We optimized the 50-dimensional version of this function using a Metropolis-Hastings scheme to 
sample the next test locations from the posterior over the maximizing argument. The Markov chain 
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Figure 4: Observation values obtained by sampling from the posterior over the maximizing argument 
(left panel) and according to GP-UCB (right panel). The solid blue curve corresponds to the time- 
averaged function value, averaged over ten runs. The gray area corresponds to the error bounds 
(1 standard deviation), and the dashed curve in red shows the time-average of a single run. 




Figure 5: a) The Noisy Ripples objective function in 2 dimensions, b) The time-averaged value and 
the regret obtained by the optimization algorithm on a 50-dimensional version of the Noisy Ripples 
function. 



was started at [20, 20, • • • , 20] T , executing 120 isotropic Gaussian steps of variance 0.07 before 
the point was used as an actual test location. For the arg-max prior, we used a Gaussian kernel 
with lengthscale I — 2, precision factor p = 1.5, prior precision Kq(x*) — 1 and prior mean 
estimate yo(x*) — — TcKTo 1 1 + 5|| 2 . The goal p was located at the origin. 

The result of one run is presented in Figure |5j). It can be seen that the optimizer manages to quickly 
(« 100 samples) reach near-optimal performance, overcoming the difficulties associated with the 
high-dimensionality of the input space and the numerous local optima. Crucial for this success was 
the choice of a kernel that is wide enough to accurately estimate the mean function. The authors are 
not aware of any method capable of solving a problem is similar characteristics. 



5 Discussion & Conclusions 



We have proposed a novel Bayesian approach to model the location of the maximizing test point of a 
noisy, nonlinear function. This has been achieved by directly constructing a probabilistic model over 
the input space, thereby bypassing having to model the underlying function space — a much harder 
problem. In particular, we derived a likelihood function that belongs to the exponential family by 
assuming a form of symmetry in function space. This in turn, enabled us to state a conjugate prior 
distribution over the optimal test point. 
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Our proposed model is computationally very efficient when compared to Gaussian process-based 
(cubic) or UCB-based models (expensive computation of arg max). The evaluation time of the 
posterior density scales quadratically in the size of the data. This is due to the calculation of the 
effective number of previously seen test locations — the kernel regressor requires linear compuation 
time. However, during MCMC steps, the effective number of test locations does not need to be 
updated as long as no new observations arrive. 

In practice, one of the main difficulties associated with our proposed method is the choice of the 
parameters. As in any kernel-based estimation method, choosing the appropriate kernel bandwidth 
can significantly change the estimate and affect the performance of optimizers that rely on the model. 
There is no clear rule on how to choose a good bandwidth. 

In a future research, it will be interesting to investigate the theoretical properties of the proposed 
nonparametric model, such as the convergence speed of the estimator and its relation to the extensive 
literature on active learning and bandits. 
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